\chapter{Testing} \label{testChapter}
\section{Getting a Large Test Sample}
To gather data that could help in either proving or rejecting the hypotheses, a series of tests were conducted in the last phase of the project. Initially, the aim was to gather a minimum of 30 test participants, since this is considered as a large sample within the field of statistics \cite{statisticsBook}.

Based on experience from previous projects, the group knew that it would be hard to get a sample this large. The best solution seemed to be to test on other Medialogy students on fourth semester. However, the other students were also going through their own projects, which meant that it would make it harder for to get them to participate in this project. Therefore, a small competition was made to ensure that as many people as possible wanted to come and test. Often small rewards are given for participating in a test like this, e.g. a piece of cake or a soda. Since we were interested in getting 30+ test participants, we wanted an even bigger motivation factor. In the end it was decided to offer a crate of beer to the group that was most successful in the test.

Even though the test was kept anonymous, the test participants would compete group-wise. Each participant would then add to the overall score of his/her own group. In the end, the group with the highest score would win the prize. It was first thought to be more fair taking the average score of the group members, but in the end it was decided to instead take the total score, i.e. each member's score summed together. This would ensure that a group did not just participate with the few of the best people; instead, their motivation was to have as many people participating to count up the total score. This means that a group participating with six people would have a better chance of winning than a group that only provided three of four people for the test.

\section{Three Test Phases}
The test was conducted as a \textit{before-and-after test} and consisted of three phases; the initial phase was to measure the test participants' initial knowledge. This was done to establish a baseline; something to compare with. The second phase had the test participants try out the program. The third phase was similar to the first phase, just with different questions. All combined, the test took about 30 minutes to complete.

Because the test lasted for an average of half an hour, which is a relatively long duration for a test like this, it was not possible to gather the 30+ test participants that was the original goal, since this would take up to 15 hours to do. Neither the group nor the test participants had the time available for this. Instead, half of this was accomplished: 15 test participants in total.

The following will describe each phase in more depth.

\subsection{Phase 1 - The Before-Test}
It was important to establish a baseline to have something to compare with. Therefore, the first phase presented the test participants with 15 questions. Each question asked the participants to either listen to or look at an audio effect and try to identify its name. The test participants were handed a paper with all the possible audio effects, but they were not told whether or not an effect could be present multiple times.

The possible effects were:
\begin{itemize}
\item Distortion
\item Wah-Wah effect
\item Phaser
\item Chorus (Flanger)
\item Vibrato
\item Equalizer (Highpass/lowpass/bandpass filters)
\item Tremolo
\item Echo (Delay)
\end{itemize}

Those in parentheses also counted as a correct answer. I.e. if a participant answered "lowpass filter", this would be the same as an equalizer.

The questions were randomized before each test participant was brought in. Also, three different versions of all of the audio effects were made. This was done to ensure variety in each audio example and give the possibility of hearing or seeing the same effect multiple times. The randomization was done by writing down all the effects on small pieces of paper and then draw them out from a hat one at a time.

As stated, the participant would alternate between hearing and seeing an effect. Before each effect, the test participant would hear/see the original signal. It was important that the participant was first presented the unmodified signal to be able to compare to the modified signal. The modified signal would then loop until the participant gave an answer.

Figure \ref{fig:beforeAfter} shows two questions. In the first, the computer screen will be blank, since the participant will have to listen to a piece of music. Then an effect will be added, and he should try to guess which of the possible effects it might be.

In question two, a sine wave is displayed visually, but no sound is audible. A total of 8 audio questions and 7 visual questions were given to each test participant.

\begin{figure}[htbp]
\centering
\includegraphics[width=0.75\textwidth]{images/Test/before_after}
\caption{Each question alternated between being auditory and visual.}
\label{fig:beforeAfter}
\end{figure}

\subsubsection{Choice of Audio}
It was decided to use a simple sine wave when the participant had to look at the waveform. From a physical standpoint, the sine wave is the simplest kind of waveform there is. It describes the ideal oscillation, and it produces a pure, simple tone that is easy to listen to and distinguish \cite{petter}. Other options would be to use a square wave, a triangle wave or a sawtooth wave. However, to keep the test as simple and short as possible, only the sine wave was used throughout all three phases of the test.

When hearing the audio, a short piece of music was used. It was chosen not to re-use the sine wave, since some of the effects are hard to hear due to the continuous nature of the sine wave, e.g. tremolo, vibrato, and echo which are hard to hear with a sine wave.

It was chosen to use \textit{The Raiders March} - the music theme from the Indiana Jones movies, composed by John Williams. First of all, the musical piece consists of a lot of variety in a short amount of time; it has both chords and beats, uses brass and string instruments, as well as percussions. The music also covers a wide frequency spectrum. The other reason was that it is one of the most-known musical themes. This made it easy for test participants to quickly recognize and hear the difference between the original music and the modified music.

%The last reason is the fact that we knew beforehand that we would hear it a lot of times, since we needed a large amount of test participants. And for some reason, the Indiana Jones theme never gets old or annoying; you can hear it multiple times over and over again, which is a good thing since both the test participant and testers will hear it repeatedly.

\subsection{Phase 2 - Trying the App} \label{wizardTest}
The actual test of the program was conducted as a \textit{Wizard of Oz test} \cite{interactionDesign}. As the name suggests, this test is operated by a "wizard". The wizard is the one pulling the strings, acting as the computer/system. Wizard of Oz tests are widely used within the field of interaction design, especially in the earlier phases of a project where full implementation is not yet completed. In this case, an external music production tool called \textit{Reason} was used. \textit{Reason} is a full-fledged software suite with tons of tools and effects. Since the focus was on testing the group's interface, and not \textit{Reason}'s.

Since the test participant solely looked at Audio Effect Box without necessarily knowing what happened "behind the curtain", a wizard was used. Put another way: the program that the test participant interacted with did not have any real functionality, but was only a graphical interface with sliders and buttons to provide feedback. The actual audio processing and tweaking of effects were done on a separate computer running the \textit{Reason} software. Figure \ref{fig:testSetup} illustrates the setup used for the test.

\begin{figure}[htbp]
\centering
\includegraphics[width=0.60\textwidth]{images/Test/Test_setup}
\caption{Illustration of the test setup.}
\label{fig:testSetup}
\end{figure}

In the second phase, the participants were asked to try out Audio Effect Box. He would interact only with the computer running the Audio Effect Box program, while the wizard would  operate the \textit{Reason} program. The test participant was told that he would have 10 minutes to play around with the program, and that he should try out all of the effects within the time limit. Other than that he received no further instructions; the idea was that he should be able to interact with the program without any help.

The test participant would then start clicking on various effects and adjusting parameters via sliders. It was then the wizard's job to adjust the same parameters inside \textit{Reason} as fast as possible. For the wizard to be able to know what the test participant was looking at, a separate computer running a program called \textit{TeamViewer} \cite{teamviewerWeb} was used. \textit{TeamViewer} basically transmits one computer screen to another via the network. This made it possible to mirror the test participants screen to another computer, so the wizard could see when the participant was moving the cursor around the screen. A tiny delay was introduced due to the nature of \textit{TeamViewer} running over the network; therefore the test participant was asked to be patient, since the program would operate slower than it normally would. That being said, the whole setup ran relatively smoothly. Only a few of the effects took some time to enable inside \textit{Reason}, such as the tremolo effect. When this occurred, the test participant was told to wait a few seconds before continuing.

In the setup, two external speakers were used to play the audio. The initial idea was to use headphones; however, it then became difficult for the wizard to operate, since he could not hear the effects himself. It was also considered using a minijack splitter, so that both the test participant and the wizard could hear the output, but this introduced noise in the signal.

\subsubsection{Three Different Types of Interaction}
Since the hypotheses set out to look at the differences between having either visual or auditory feedback, or, both at the same time, it was important to test people differently. Some of the test participants used the program having access to both audio (via speakers) and visual (time domain and frequency domain visualized in the program). Others either had access to audio only or visual only. To make the test as balanced as possible, the same amount of test participants for each interaction type was found - in this case, five people were trying the combination of both (\textbf{AV}), five people trying audio only (\textbf{A}), and five people trying only visual (\textbf{V}).

\subsubsection{Roles for the Testing}
When conducting the test, four roles were needed:

\textit{Test Participant}
The person who was tested using the program. The person had no previous experience with using Audio Effect Box.

\textit{Facilitator}
The person who was responsible for guiding and helping the test participant. He had a pre-written manuscript that was read out loud. This was to ensure that every test participant would be treated the same and getting the same information the same way. In fact, the test was almost conducted as what is know as a \textit{double blind test}, where neither the test participant or the tester know the correct answers. This is often used when doing placebo tests, e.g. when testing a medical drug. Here, neither the test participant or the facilitator knew the correct answer to the audio effects, since it was the wizard who handled everything with playing the right audio effect. This scenario is similar to how the TV program \textit{Who Wants to Be a Millionaire?} is structured, where the host also doesn't know the answers beforehand. This makes it impossible for the test participant to be able to get any help from or "read" the answer from the facilitator's body language.

\textit{Wizard}
The person responsible for the actual audio effects. In an optimal setting this person should not be placed in the same room as the test participant. This would ensure that the setup did not feel artificial or fake. Unfortunately, this was not possible, so instead the two were placed with their backs/sides to each other.

\textit{Wizard's Helper}
Since \textit{Reason} is a relatively complex program, sometimes it would be hard for the wizard to operate it fast enough himself. Therefore a second person was seated next to him to help with various tasks. This person also took notes.

\subsection{Phase 3 - The After-Test}
Since the test was conducted as a before-and-after test, it was necessary to make a third phase that tested the same aspects as the first phase. The difference here was the fact that the test participant now had tried out the program - and, hopefully, improved their skills.

The procedure for the third phase was exactly the same as the first phase, just with a new set of 15 randomized questions.

\section{General Observations from the Test}
The following will describe some general thoughts and comments that were written down during the test sessions, as well as potential \textit{nuisance factors} (sources of errors) \cite{MM9}.

\textbf{People did not understand the effect of tweaking multiple filters at once}

Most of the test participants started out by adjusting a single slider, going from one extreme (turned off) to the other (fully turned on). Many test participants did this: first they tried changing slider one back and forth; then they turned it off and went on to try slider two; then they turned slider two off and went to slider three, etc.

However, many of the effects don't express an immediate result from adjusting a single parameter. An example of this was the echo effect: the first sliders controlled the parameters \textit{delay time}, \textit{diffusion} and \textit{decay}. However, if the \textit{dry/wet} parameter was turned all the way down, it appeared as if any of the previous parameters had no impact at all. This happened multiple times, and it was obvious that people were confused due to the lack of feedback, changing a slider had in some cases no auditory nor visual result, before tweaking another parameter). It would have been optimal to make it clear what sliders were important and why the test participants needed to change multiple sliders at once. There should be some immediate feedback telling you that the program was indeed working; otherwise people would be in doubt whether or not it was working properly.

\textbf{People became tired for the third phase}

Since the test lasted approximately 30 minutes (10 minutes in each of the three phases), the test participants clearly had a tendency to become unfocused and a little bored by the third phase. They had to listen to/watch the same audio clips over and over again (swapping between the Indiana Jones theme and a sine wave), which could become tiresome. This could have a negative impact on how they answered in the third phase compared to the first phase. Before going to the first phase, the participants did not know what to expect. But when going to the third phase, they knew they had to listen to/watch 15 audio effects.

\textbf{Confusions between music and sine wave}

It was decided not to tell the participant about the true goal of the test: examining whether  having only audio or only visual was better than having both. This lead to some confusion when participants in the second phase tried out the program only with either audio or visual. The program has two buttons that allow for switching between playing the Indiana Jones music and a sine wave. However, one participant thought these buttons were the same as choosing between either hearing or seeing the audio. This was because the same thing had occurred in the previous phase, the first phase, where he would always \textbf{hear} the Indiana Jones music and \textbf{see} the sine wave. This confused him when he did the second phase, since he happened to fall into the category of only having auditory feedback. It could be argued that it would be better to explain beforehand that he was going to only have auditory feedback due to the goal of the test.

\textbf{Did not understand the goal of the test}

Before each test session, the facilitator read out the manuscript, describing the test (see appendix \ref{manuscript}). It was decided to inform the participants that there would be three phases, but each phase would only be described when it was about to be conducted, i.e. the participant would first hear about phase 1, then go through it. Afterwards he would hear about phase 2, then go though it. Lastly, he would hear about and go through phase 3. Intentionally, it was not made clear that it was a before-and-after test, and that the goal was to use phase 2 to become better in the test in phase 3. This is akin to studying only to become better at an exam; it has an artificial or shallow feel to it. Instead, it was hoped that the participants would just try their best without thinking too much of becoming better in the second test in phase 3.

However, it turned out that some of the participants didn't like this approach and became confused because of it. It was late into the test that they realized the fact that they were supposed to get better for the second test in phase 3. This was frustrating to some of the test participants.

\textbf{Didn't try sine wave and music with all effects in phase 2}

One thing that was observed early on when doing the second phase with people who had either audio (A) or audio and visual (AV), was that within the 10 minutes available they didn't listen to the sine wave and music equally.

Before the participant tried out the program, it was explained that they could switch between listening to a sine wave and the music from Indiana Jones by pressing one of two buttons. However, it turned out that most participants decided to choose one (most often the music, since it was more pleasant to listen to) and then continuing with that for the duration of the test. If this happened, the facilitator would remind them again after approximately five minutes that they could switch between the two. Despite this, it was clear that many participants solely heard a specific audio effect with either only the music or only the sine wave. This was problematic, since some of the effects are easier to hear with sometimes the sine wave and other times the music. The optimal would be to listen to both versions.

However, in a real-life situation, it would be up to the users to decide what they found most relevant to use. Therefore, the group didn't want to force them to hear each audio effect both with music and with sine wave. In the end, a typical pattern in phase 2 of the test would look something like the following:

\begin{itemize}
\item Minutes 0-4: test participant hears the effect using the Indiana Jones theme
\item Minutes 5-6: facilitator reminds participants that they can switch between sine wave and music
\item Minutes 6-7: test participant switches back and forth between sine wave and music
\item Minutes 8-10: test participant almost only listens to the music and doesn't use the sine wave at all
\end{itemize}

This meant that some of the first effects (typically those placed the highest in the program's interface) were only heard via the music theme and not the sine wave. This might have contributed to an unintentional bias in the test results.

\textbf{Not everybody spent 10 minutes with the program}

Depending on what types of interaction the test participants had when trying the program (either audio only, visual only or audio and visual combined), they would be more or less engaged in using the program. Before the second phase started, they were told that they had 10 minutes available to try out the program. However, it turned out that some participants didn't need all of the time, while others thought it was too short. In general, people who had only access to the auditory part were less eager to continue for the full 10 minutes. Many of the audio-only participants spent only half of the time, approximately four or five minutes. Compared to this, people with only the visual part spent more time in general, about eight to ten minutes. Those with both audio and visual were more eager to spend the full 10 minutes for the test.

If a test participant declared that he was finished before the 10-minute mark, the facilitator would inform him that he had more time left, but would not force the participant to continue if he didn't want to.

This shows that different types of stimuli engage more or less. Apparently having audio only is less interesting than having visual only.

\textbf{Some effects were hard to distinguish from each other}

Many test participants thought it was hard to distinguish tremolo and vibrato from each other, especially if they only heard the music but didn't see the sine wave. The same applies for the chorus and echo effect. Apparently, the chorus effect was difficult to identify.

\textbf{People were unsure when hearing the same effect twice}

If a test participant heard an effect multiple times during either phase 1 or phase 3, he would be in doubt what to answer. It was by intentionally not stated that the same effect could be presented multiple times. If they had no clue what effect it was, only that they had heard it before, they often answered the same, despite what they had learned through the program in phase 2.

\textbf{Not an even amount of A and V questions}

Initially the plan was to ask 10 questions with audio and 10 with visual (20 in total). However, a pilot test revealed that this took too long, and therefore it was decided to cut down to 15 questions instead. What was not realized at the time was that 15 is an uneven number, meaning that there would be more audio-based questions than visual-based (eight in audio versus seven visual). This became clear when the group started analyzing the results. A solution would be to throw away one of the audio questions; however, it was decided to keep it, even though it might create imbalance.

\textbf{Remembering effect name is not the same as understanding it}

It is hard to measure whether or not somebody understands and recognizes something. It was chosen to go with a simple approach of being able to identify a certain audio effect by its name. However, it turned out that some test participants did not focus so much on the effect's name; they did not remember what it was called, but still argued that they understood what it did.

Something else to have in mind is where and how the name of each audio effect are displayed in the program. There are multiple "focus areas" in the program, and a test participant may look more at some areas than others. Figure \ref{fig:focusArea} show some of the areas a person might focus on. An improved test might utilize eye tracking to examine where the participant is looking; however, this was outside of the project's scope.

\begin{figure}[htbp]
\centering
\includegraphics[width=0.7\textwidth]{images/Test/focus_areas}
\caption{Areas that the test participant might focus on.}
\label{fig:focusArea}
\end{figure}

\textbf{Sometimes the program felt unresponsive}

Due to the nature of the test, using a Wizard of Oz approach, it sometimes took a little while to adjust the correct parameters inside \textit{Reason}. This led to an experience where the program not always operated as smoothly as possible. However, this had already been stated by the facilitator before the second phase began.

Another thing to note is the fact that the wizards naturally became better and faster after each test session. Although the wizards had already practised in a few pilot tests, they naturally became better after doing a lot of test sessions. This means that the first test participant experienced a less-optimal and slower program than the last test participant. That being said, there shouldn't be too big of a difference in the overall experience using the program.

\textbf{The order of the test sessions should be randomized}

Along the same lines, one thing that could have been done differently is the order of the test participant's interaction type (A, V or AV). The group ended up testing the first five people in AV, then three in A, then three in V, then two in A and, lastly, two in V. As stated above, the wizards naturally became better operating \textit{Reason}. Therefore, it could be imagined that the last participants had an advantage over the first participants. Since all of the AV test participants were in the beginning, they may have had a harder time because the program overall felt less responsive than what those in the end experienced.

The optimal solution would be to spread out and randomize each interaction type. Doing this would ensure that all three types were spread out and tested both in the beginning (when the wizards were less experienced), middle (wizards became better) and end (wizards were at their best).

\section{Looking at the Test Results}
As stated earlier, 15 test samples were gathered from fellow Medialogy students. Of those 15, one was female and the rest were male. Their ages ranged from 20 to 32 years and spanned seven different groups.

All of the data were written down on a piece of paper for each test participant (see appendix \ref{testPaper}) and then later imported to a digital spreadsheet. Figure \ref{fig:spreadSheet} shows how the data looked in the spreadsheets. Each individual question had its own row, meaning that there were 15 x 2 rows for each test participant (15 questions before and 15 questions after). Each participant was assigned an ID data such as age and group number was noted down, as well as who conducted the test. This was noted down, since maybe there would be a difference depending on who conducted the test (the conductors and wizards were switching roles after a few tests, since it was exhausting to conduct it multiple times). Interaction type (A, V or AV) was noted down; the same for each of the questions.

\begin{figure}[htbp]
\centering
\includegraphics[width=1\textwidth]{images/Test/Results_spreadsheet}
\caption{How the data was structured in a spreadsheet.}
\label{fig:spreadSheet}
\end{figure}

Lastly, the participant's answer was written down, as well as the correct answer. Since there were multiple versions of each effect (such as \textit{Echo1}, \textit{Echo2}, \textit{Echo3}), a boolean value was also used to show whether or not the answer was correct. It was chosen to keep the specific effect name. Instead of just writing "Distortion", the whole name, "Distortion2", was written down, because maybe it would turn out that some of the versions were easier to distinguish than others. Also, if somebody answered "equalizer", this would still count as a correct answer, if the specific example was e.g. a high pass filter. Same goes for answering "delay"/"echo" and "chorus"/"flanger" - these were considered the same in this test.

\subsection{Analyzing the Data}
In analzing the data, it was needed to convert the spreadsheet's long list of answers into lines where the correct answers had been extracted and counted. This made it possible to compare the different cells to each other, to see if there had been any improvement or not (see figure \ref{fig:Spreadsheet_procent}). Sorting it like this made it possible to compare A, V and AV participants with each other. This will later be used to accept or reject the null hypotheses.

\begin{figure}[htbp]
\centering
\includegraphics[width=1\textwidth]{images/Test/Spreadsheet_procent}
\caption{The data stored for each participant.}
\label{fig:Spreadsheet_procent}
\end{figure}

After the sorting, it was possible to make a chart over the participants that improved and those who did not. This chart can be seen in figure \ref{fig:Interaction_chart}, where all the participants that improved in the test are shown in blue, and those who became worse are shown in red. From this it is already possible to see that the only place where there were improvements were on sample group V. The other two did not improve.

\begin{figure}[htbp]
\centering
\includegraphics[width=0.8\textwidth]{images/Test/Interaction_chart}
\caption{Chart over improvement of the participants.}
\label{fig:Interaction_chart}
\end{figure}

After seeing that the V sample had the biggest improvements, it would be interesting to see where the different test participants had answered most questions correctly in phase 1 and phase 3 (see figure \ref{fig:BnA_total}). The figure shows the test participants that had improved and those who did not. It also shows that there was not a clear indication that the improvements of the wizards had an impact on the participants' knowledge about audio effects gained through the use of the program. 

\begin{figure}[htbp]
\centering
\includegraphics[width=1\textwidth]{images/Test/BnA_total}
\caption{Before-and-after results total.}
\label{fig:BnA_total}
\end{figure}

It was also possible to see the different answers in percentage in both the A and V questions from the before and after test (see figure \ref{fig:BnA_split}). From the chart it was possible to see that the improvement in A answers were more prominent than in the V sample, since there were three out of five that had improved. The results of A and AV combined showed that only four out of ten had improved. 

\begin{figure}[htbp]
\centering
\includegraphics[width=1\textwidth]{images/Test/BnA_split}
\caption{Before-and-after results - A and V split up.}
\label{fig:BnA_split}
\end{figure}

When wanting to compare the test data to the null hypotheses, the data was illustrated as a whisker boxplot (see figures \ref{fig:before} and \ref{fig:after}). The distance between the quartiles have to be of approximately the same size to indicate a normal distribution.

\begin{figure}[htbp] \centering
\begin{minipage}[b]{0.45\textwidth} \centering
\includegraphics[width=1\textwidth]{images/Test/Before} % Venstre billede
\end{minipage} \hfill
\begin{minipage}[b]{0.45\textwidth} \centering
\includegraphics[width=1\textwidth]{images/Test/After} % Højre billede
\end{minipage} \\ % Captions og labels
\begin{minipage}[t]{0.45\textwidth}
\caption{Boxplot of the before-test.} % Venstre caption og label
\label{fig:before}
\end{minipage} \hfill
\begin{minipage}[t]{0.45\textwidth}
\caption{Boxplot of the after-test.} % Højre caption og label
\label{fig:after}
\end{minipage}
\end{figure}

\textbf{Kruskal Wallis Analysis}

After finding out that the data is normally distributed, it is then possible to perform the \textit{Kruskal Wallis analysis}, which is a method of analysing the variance. The Kruskal Wallis Analysis indicates whether the samples originate from the same distribution. This calculation was done in a program called \textit{R} \cite{r}. The program tells us that the p-value calculated from the scores from the before tests is \textit{0.48} and after trying the program is \textit{0.55}. The program also indicates that the p-value calculated from the difference in scores between the tests is \textit{0.61}. The p-value is the value that is used to accept or reject the null hypotheses \cite{MM9}. If the p-value is over 0.05, then the probability of the null hypotheses being accepted is large, meaning that it is less probable to reject the null hypotheses.

%%This means that none of the null hypotheses has not been rejected and should be considered.

%If P value is smaller --> less probability of null hypothesis to be correct --> higher probability of rejecting the null hypothesis

%If P value is greater --> more probability of null hypothesis to be correct --> less probability of rejecting the null hypothesis%%

\section{Interpretation of Results}

From the results of the analysis, it can be concluded that the group cannot reject any of the null hypotheses. None of the score results or score differences differ sufficiently from each other. The only null hypothesis close to being rejectable is 3a, because the final score from the participants using the program with only visuals was slightly better than the score from the ones using the program with only audio - but this difference was not sufficient.

To conclude properly on this project, though, a lot more test participants would be needed. Only having five participants per version of the program proved far too scarce, and it made the analysis of the results almost inconclusive.

Every question in the test required an answer, which, when the participants could not answer, forced them to make a guess. This automatically gives the results an error margin. Whether the participants had improved or worsened between tests, because of this error margin, the results need to differ even more to be able to conclude upon.

Furthermore, the test lacked a control group to ensure the elimination of undesired factors.